Sanitization and Anonymization of Document Repositories
نویسندگان
چکیده
Information security and privacy in the context of the World Wide Web (WWW) is an important issue that is still being investigated. However, most of the present research is dealing with access control, and authentication-based trust. Especially with the popularity of WWW, being one of the largest information sources, privacy of the individuals is now as important as the security of information. In this chapter, our focus is text, which is probably the most frequently seen data type in the WWW. Our aim is to highlight the possible threats to privacy that exist due to the availability of document repositories and sophisticated tools to browse and analyze these documents. We first identify possible threats to privacy in document repositories. We then discuss a measure for privacy in documents with some possible solutions to avoid or at least alleviate these threats.
منابع مشابه
t-Plausibility: Generalizing Words to Desensitize Text
De-identified data has the potential to be shared widely to support decision making and research. While significant advances have been made in anonymization of structured data, anonymization of textual information is in it infancy. Document sanitization requires finding and removing personally identifiable information. While current tools are effective at removing specific types of information ...
متن کاملDocument Sanitization: Measuring Search Engine Information Loss and Risk of Disclosure for the Wikileaks cables
In this paper we evaluate the effect of a document sanitization process on a set of information retrieval metrics, in order to measure information loss and risk of disclosure. As an example document set, we use a subset of the Wikileaks Cables, made up of documents relating to five key news items which were revealed by the cables. In order to sanitize the documents we have developed a semi-auto...
متن کاملAdditive Sanitization: A Technique for Pattern-Preserving Anonymization for Time-Series Data
A time series is a set of data normally collected at usual intervals and often contains huge amount of individual privacy. The need to protect privacy and anonymization of time-series while trying to support complex queries such as pattern range and pattern matching queries. The conventional (k, p)-anonymity model cannot effectively address this problem as it may suffer serious pattern loss. In...
متن کاملAn Information Retrieval Approach to Document Sanitization
In this paper we use information retrieval metrics to evaluate the effect of a document sanitization process, measuring information loss and risk of disclosure. In order to sanitize the documents we have developed a semiautomatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration. It embodies two main steps: (i) identifying and anonymizing sp...
متن کاملPrivacy, Anonymization, Anomaly Detection
The sharing of network traces is an important prerequisite for the development and evaluation of efficient anomaly detection mechanisms. Unfortunately, privacy concerns and data protection laws prevent network operators from sharing these data. Anonymization is a promising solution in this context; however, it is unclear if the sanitization of data preserves the traffic characteristics or intro...
متن کامل